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Abstract 

We address a question answering task on real-world im¬ 
ages that is set up as a Visual Turing Test. By combining 
latest advances in image representation and natural lan¬ 
guage processing, we propose Neural-Image-QA, an end- 
to-end formulation to this problem for which all parts are 
trained jointly. In contrast to previous efforts, we are facing 
a multi-modal problem where the language output (answer) 
is conditioned on visual and natural language input (image 
and question). Our approach Neural-Image-QA doubles the 
performance of the previous best approach on this problem. 
We provide additional insights into the problem by analyz¬ 
ing how much information is contained only in the language 
part for which we provide a new human baseline. To study 
human consensus, which is related to the ambiguities inher¬ 
ent in this challenging task, we propose two novel metrics 
and collect additional answers which extends the original 
DAQUAR dataset to DAQUAR-Consensus. 

1. Introduction 

With the advances of natural language processing and 
image understanding, more complex and demanding tasks 
have become within reach. Our aim is to take advantage 
of the most recent developments to push the state-of-the- 
art for answering natural language questions on real-world 
images. This task unites inference of question intends and 
visual scene understanding with a word sequence prediction 
task. 

Most recently, architectures based on the idea of lay¬ 
ered, end-to-end trainable artificial neural networks have 
improved the state of the art across a wide range of diverse 
tasks. Most prominently Convolutional Neural Networks 
have raised the bar on image classification tasks [16] and 
Long Short Term Memory Networks are dominating per¬ 
formance on a range of sequence prediction tasks such as 
machine translation [28]. 

Very recently these two trends of employing neural ar¬ 
chitectures have been combined fruitfully with methods that 
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Figure 1. Our approach Neural-Image-QA to question answering 
with a Recurrent Neural Network using Long Short Term Memory 
(LSTM). To answer a question about an image, we feed in both, 
the image (CNN features) and the question (green boxes) into the 
LSTM. After the (variable length) question is encoded, we gener¬ 
ate the answers (multiple words, orange boxes). During the answer 
generation phase the previously predicted answers are fed into the 
LSTM until the (END) symbol is predicted. 


can generate image [12] and video descriptions [30]. Both 
are conditioning on the visual features that stem from deep 
learning architectures and employ recurrent neural network 
approaches to produce descriptions. 

To further push the boundaries and explore the limits 
of deep learning architectures, we propose an architecture 
for answering questions about images. In contrast to prior 
work, this task needs conditioning on language as well 
visual input. Both modalities have to be interpreted and 
jointly represented as an answer depends on inferred mean¬ 
ing of the question and image content. 

While there is a rich body of work on natural language 
understanding that has addressed textual question answer¬ 
ing tasks based on semantic parsing, symbolic representa¬ 
tion and deduction systems, which also has seen applica¬ 
tions to question answering on images [20], there is initial 
evidence that deep architectures can indeed achieve a sim¬ 
ilar goal [33]. This motivates our work to seek end-to-end 
architectures that learn to answer questions in a single holis¬ 
tic and monolithic model. 

We propose Neural-Image-QA, an approach to question 























































answering with a recurrent neural network. An overview 
is given in Figure 1. The image is analyzed via a Convo¬ 
lutional Neural Network (CNN) and the question together 
with the visual representation is fed into a Long Short Term 
Memory (LSTM) network. The system is trained to pro¬ 
duce the correct answer to the question on the image. CNN 
and LSTM are trained jointly and end-to-end starting from 
words and pixels. 

Contributions: We proposes a novel approach based on re¬ 
current neural networks for the challenging task of answer¬ 
ing of questions about images. It combines a CNN with a 
LSTM into an end-to-end architecture that predict answers 
conditioning on a question and an image. Our approach 
significantly outperforms prior work on this task - doubling 
the performance. We collect additional data to study human 
consensus on this task, propose two new metrics sensitive 
to these effects, and provide a new baseline, by asking hu¬ 
mans to answer the questions without observing the image. 
We demonstrate a variant of our system that also answers 
question without accessing any visual information, which 
beats the human baseline. 

2. Related Work 

As our method touches upon different areas in machine 
learning, computer vision and natural language processing, 
we have organized related work in the following way: 

Convolutional Neural Networks for visual recognition. 

We are building on the recent success of Convolutional Neu¬ 
ral Networks (CNN) for visual recognition [16, 17, 25], that 
are directly learnt from the raw image data and pre-trained 
on large image corpora. Due to the rapid progress in this 
area within the last two years, a rich set of models [27, 29] 
is at our disposal. 

Recurrent Neural Networks (RNN) for sequence model¬ 
ing. Recurrent Neural Networks allow Neural Networks 
to handle sequences of flexible length. A particular variant 
called Long Short Term Memory (LSTM) [9] has shown 
recent success on natural language tasks such as machine 
translation [3, 28]. 

Combining RNNs and CNNs for description of visual 
content. The task of describing visual content like still 
images as well as videos has been successfully addressed 
with a combination of the previous two ideas [5, 12, 31, 32, 
37]. This is achieved by using the RNN-type model that 
first gets to observe the visual content and is trained to af¬ 
terwards predict a sequence of words that is a description of 
the visual content. Our work extends this idea to question 
answering, where we formulate a model trained to generate 
an answer based on visual as well as natural language input. 

Grounding of natural language and visual concepts. 

Dealing with natural language input does involve the asso¬ 


ciation of words with meaning. This is often referred to as 
grounding problem - in particular if the “meaning” is associ¬ 
ated with a sensory input. While such problems have been 
historically addressed by symbolic semantic parsing tech¬ 
niques [15, 22], there is a recent trend of machine learning- 
based approaches [12, 13, 14] to find the associations. Our 
approach follows the idea that we do not enforce or evaluate 
any particular representation of “meaning” on the language 
or image modality. We treat this as latent and leave this to 
the joint training approach to establish an appropriate inter¬ 
nal representation for the question answering task. 

Textual question answering. Answering on purely tex¬ 
tual questions has been studied in the NLP community 
[2, 18] and state of the art techniques typically employ 
semantic parsing to arrive at a logical form capturing the 
intended meaning and infer relevant answers. Only very 
recently, the success of the previously mentioned neural 
sequence models as RNNs has carried over to this task 
[10, 33]. More specifically [10] uses dependency-tree Re¬ 
cursive NN instead of LSTM, and reduce the question¬ 
answering problem to a classification task. Moreover, ac¬ 
cording to [10] their method cannot be easily applied to vi¬ 
sion. [33] propose different kind of network - memory net¬ 
works - and it is unclear how to apply [33] to take advantage 
of the visual content. However, neither [10] nor [33] show 
an end-to-end, monolithic approaches that produce multiple 
words answers for question on images. 

Visual Turing Test. Most recently several approaches 
have been proposed to approach Visual Turing Test [21], 
i.e. answering question about visual content. For instance 
[8] have proposed a binary (yes/no) version of Visual Tur¬ 
ing Test on synthetic data. In [20], we present a question 
answering system based on a semantic parser on a more var¬ 
ied set of human question-answer pairs. In contrast, in this 
work, our method is based on a neural architecture, which 
is trained end-to-end and therefore liberates the approach 
from any ontological commitment that would otherwise be 
introduced by a semantic parser. 

We like to note that shortly after this work, several 
neural-based models [24, 19, 7] have also been suggested. 
Also several new datasets for Visual Turing Tests have just 
been proposed [1, 35] that are worth further investigations. 

3. Approach 

Answering questions on images is the problem of pre¬ 
dicting an answer a given an image x and a question q ac¬ 
cording to a parametric probability measure: 

d = argmaxp(a|cc, q; 0) (1) 

aeA 

where 6 represent a vector of all parameters to learn and A 
is a set of all answers. Later we describe how we represent 
X, a, q, andp(-|x, q; 6) in more details. 


LSTM Unit 



Figure 2. Our approach Neural-Image-QA, see Section 3 for de¬ 
tails. 


In our scenario questions can have multiple word an¬ 
swers and we consequently decompose the problem to pre¬ 
dicting a set of answer words aq ^ = { ai, a 2 ,..., CLU{q,x )} ^ 
where at are words from a finite vocabulary V', and 
x) is the number of answer words for the given ques¬ 
tion and image. In our approach, named Neural-Image-QA, 
we propose to tackle the problem as follows. To predict 
multiple words we formulate the problem as predicting a se¬ 
quence of words from the vocabulary V := V' U {$} where 
the extra token $ indicates the end of the answer sequence, 
and points out that the question has been fully answered. 
We thus formulate the prediction procedure recursively: 

at = argmaxp(a|x,q, (2) 

aev 

where At-i = {di,...,dt_i}is the set of previous words, 
with Ao = {} at the beginning, when our approach has not 
given any answer so far. The approach is terminated when 
at = $. We evaluate the method solely based on the pre¬ 
dicted answer words ignoring the extra token $. To ensure 
uniqueness of the predicted answer words, as we want to 
predict the set of answer words, the prediction procedure 
can be be trivially changed by maximizing over V \ At_i. 
However, in practice, our algorithm learns to not predict any 
previously predicted words. 

As shown in Figure 1 and Figure 2, we feed Neural-Image- 
QA with a question as a sequence of words, i.e. q = 
• • • 5 Qn-i 5 1?]], where each is the t-th word ques¬ 
tion and I?] := encodes the question mark - the end of 
the question. Since our problem is formulated as a variable- 
length input/output sequence, we model the parametric dis¬ 
tribution p(-|x, q; 0) of Neural-Image-QA with a recurrent 
neural network and a softmax prediction layer. More pre¬ 
cisely, Neural-Image-QA is a deep network built of CNN 
[ 1 7] and Long-Short Term Memory (LSTM) [9]. LSTM has 
been recently shown to be effective in learning a variable- 
length sequence-to-sequence mapping [5, 28]. 

Both question and answer words are represented with 



Figure 3. LSTM unit. See Section 3, Equations (3)-(8) for details. 


one-hot vector encoding (a binary vector with exactly one 
non-zero entry at the position indicating the index of the 
word in the vocabulary) and embedded in a lower dimen¬ 
sional space, using a jointly learnt latent linear embedding. 
In the training phase, we augment the question words se¬ 
quence q with the corresponding ground truth answer words 
sequence a, i.e. q := [q,a]. During the test time, in the 
prediction phase, at time step t, we augment q with previ¬ 
ously predicted answer words di..t := [di,..., dt-i], i.e. 
q^ := [q, di..t]. This means the question q and the previous 
answers are encoded implicitly in the hidden states of the 
LSTM, while the latent hidden representation is learnt. We 
encode the image x using a CNN and provide it at every 
time step as input to the LSTM. We set the input Vt as a 
concatenation of [x^q^]. 

As visualized in detail in Figure 3, the LSTM unit takes 
an input vector Vt at each time step t and predicts an out¬ 
put word Zt which is equal to its latent hidden state ht. As 
discussed above Zt is a linear embedding of the correspond¬ 
ing answer word at. In contrast to a simple RNN unit the 
LSTM unit additionally maintains a memory cell c. This 
allows to learn long-term dynamics more easily and signifi¬ 
cantly reduces the vanishing and exploding gradients prob¬ 
lem [9]. More precisely, we use the LSTM unit as described 
in [36] and the Caffe implementation from [5]. With the 
sigmoid nonlinearity a : M [0,1]^ cr(i;) = (1 -h e“^) ^ 
and the hyperbolic tangent nonlinearity (j) : M i-A [—1,1]^ 
0('^) = = 2a{2v) — 1, the LSTM updates for time 

step t given inputs Vt, ht-i, and the memory cell Ct-i as 
follows: 


it = a{WyiVt + Whiht-i + bi) ( 3 ) 

ft = <^{WvfVt + Whfht-i + bf) ( 4 ) 

Of = a{Wvo'Vt + Whoht-i + bo) ( 5 ) 

9t ~ + Whght-i + bg) ( 6 ) 

Ct = ft ®Cf-i + it 0gt (7) 

ht = ot0 (^(ct) (8) 


where © denotes element-wise multiplication. All the 
weights W and biases b of the network are learnt jointly 
with the cross-entropy loss. Conceptually, as shown in 






































Figure 3, Equation 3 corresponds to the input gate, Equa¬ 
tion 6 the input modulation gate, and Equation 4 the forget 
gate, which determines how much to keep from the previ¬ 
ous memory Ct-i state. As Figures 1 and 2 suggest, all the 
output predictions that occur before the question mark are 
excluded from the loss computation, so that the model is 
penalized solely based on the predicted answer words. 

Implementation We use default hyper-parameters of 
LSTM [5] and CNN [11]. All CNN models are first pre¬ 
trained on the ImageNet dataset [25], and next we randomly 
initialize and train the last layer together with the LSTM 
network on the task. We find this step crucial in obtaining 
good results. We have explored the use of a 2 layered LSTM 
model, but have consistently obtained worse performance. 
In a pilot study, we have found that GoogleNet architecture 
[11, 29] consistently outperforms the AlexNet architecture 
[11, 16] as a CNN model for our task and model. 

4. Experiments 

In this section we benchmark our method on a task of 
answering questions about images. We compare different 
variants of our proposed model to prior work in Section 4.1. 
In addition, in Section 4.2, we analyze how well questions 
can be answered without using the image in order to gain 
an understanding of biases in form of prior knowledge and 
common sense. We provide a new human baseline for this 
task. In Section 4.3 we discuss ambiguities in the question 
answering tasks and analyze them further by introducing 
metrics that are sensitive to these phenomena. In particular, 
the WUPS score [20] is extended to a consensus metric that 
considers multiple human answers. Additional results are 
available in the supplementary material and on the project 
webpage ^ 

Experimental protocol We evaluate our approach on the 
DAQUAR dataset [20] which provides 12,468 human ques¬ 
tion answer pairs on images of indoor scenes [26] and fol¬ 
low the same evaluation protocol by providing results on 
accuracy and the WUPS score at {0.9,0.0}. We run exper¬ 
iments for the full dataset as well as their proposed reduced 
set that restricts the output space to only 37 object cate¬ 
gories and uses 25 test images. In addition, we also evaluate 
the methods on different subsets of DAQUAR where only 1, 
2, 3 or 4 word answers are present. 

WUPS scores We base our experiments as well as the 
consensus metrics on WUPS scores [20]. The metric is a 
generalization of the accuracy measure that accounts for 
word-level ambiguities in the answer words. For instance 
‘carton’ and ‘box’ can be associated with a similar concept, 

^ https://www.d2.mpi-inf.mpg.de/ 
visual-turing-challenge 



Accu¬ 

racy 

WUPS 

@0.9 

WUPS 

@0.0 

Malinowski et al. [20] 

7.86 

11.86 

38.79 

Neural-Image-QA (ours) 

- multiple words 

17.49 

23.28 

57.76 

- single word 

19.43 

25.28 

62.00 

Human answers [20] 

50.20 

50.82 

67.27 

Language only (ours) 

- multiple words 

17.06 

22.30 

56.53 

- single word 

17.15 

22.80 

58.42 

Human answers, no images 

7.34 

13.17 

35.56 


Table 1. Results on DAQUAR, all classes, single reference, in %. 

and hence models should not be strongly penalized for this 
type of mistakes. Formally: 

1 ^ 

WUPS(A,T) = JJ inax/i(a,t), 

iV . 

«=1 aeA^ 

max /i(a, t)} 
aeA^ 

teT^ 

To embrace the aforementioned ambiguities, [20] suggest 
using a thresholded taxonomy-based Wu-Palmer similarity 
[34] for /i. The smaller the threshold the more forgiving 
metric. As in [20], we report WUPS at two extremes, 0.0 
and 0.9. 

4.1. Evaluation of Neural-Image-QA 

We start with the evaluation of our Neural-Image-QA on 
the full DAQUAR dataset in order to study different vari¬ 
ants and training conditions. Afterwards we evaluate on the 
reduced DAQUAR for additional points of comparison to 
prior work. 

Results on full DAQUAR Table 1 shows the results of 
our Neural-Image-QA method on the full set (“multiple 
words”) with 653 images and 5673 question-answer pairs 
available at test time. In addition, we evaluate a variant that 
is trained to predict only a single word (“single word”) as 
well as a variant that does not use visual features (“Lan¬ 
guage only”). In comparison to the prior work [20] (shown 
in the first row in Table 1), we observe strong improvements 
of over 9% points in accuracy and over 11% in the WUPS 
scores [second row in Table 1 that corresponds to “multi¬ 
ple words”]. Note that, we achieve this improvement de¬ 
spite the fact that the only published number available for 
the comparison on the full set uses ground truth object an¬ 
notations [20] - which puts our method at a disadvantage. 
Further improvements are observed when we train only on 
a single word answer, which doubles the accuracy obtained 












Accu¬ 

WUPS 

WUPS 


racy 

@0.9 

@0.0 

Neural-Image-QA (ours) 

21.67 

27.99 

65.11 

Language only (ours) 

19.13 

25.16 

61.51 


Table 2. Results of the single word model on the one-word answers 
subset of DAQUAR, all classes, single reference, in %. 




Words number 


Figure 4. Language only (blue bar) and Neural-Image-QA (red 
bar) “multi word” models evaluated on different subsets of 
DAQUAR. We consider 1, 2, 3, 4 word subsets. The blue and 
red horizontal lines represent “single word” variants evaluated on 
the answers with exactly 1 word. 


in prior work. We attribute this to a joint training of the lan¬ 
guage and visual representations and the dataset bias, where 
about 90% of the answers contain only a single word. 

We further analyze this effect in Figure 4, where we 
show performance of our approach (“multiple words”) in 
dependence on the number of words in the answer (trun¬ 
cated at 4 words due to the diminishing performance). The 
performance of the “single word” variants on the one-word 
subset are shown as horizontal lines. Although accuracy 
drops rapidly for longer answers, our model is capable of 
producing a significant number of correct two words an¬ 
swers. The “single word” variants have an edge on the sin¬ 
gle answers and benefit from the dataset bias towards these 
type of answers. Quantitative results of the “single word” 
model on the one-word answers subset of DAQUAR are 
shown in Table 2. While we have made substantial progress 
compared to prior work, there is still a 30% points margin to 
human accuracy and 25 in WUPS score [“Human answers” 
in Table 1]. 


Results on reduced DAQUAR In order to provide perfor¬ 
mance numbers that are comparable to the proposed Multi- 
World approach in [20], we also run our method on the re¬ 
duced set with 37 object classes and only 25 images with 
297 question-answer pairs at test time. 

Table 3 shows that Neural-Image-QA also improves on 
the reduced DAQUAR set, achieving 34.68% Accuracy and 
40.76% WUPS at 0.9 substantially outperforming [20] by 



Accu¬ 

racy 

WUPS 

@0.9 

WUPS 

@0.0 

Malinowski et al. [20] 

12.73 

18.10 

51.47 

Neural-Image-QA (ours) 

- multiple words 

- single word 

29.27 

34.68 

36.50 

40.76 

79.47 

79.54 

Language only (ours) 

- multiple words 

- single word 

32.32 

31.65 

38.39 

38.35 

80.05 

80.08 


Table 3. Results on reduced DAQUAR, single reference, with 
a reduced set of 37 object classes and 25 test images with 297 
question-answer pairs, in % 

21.95% Accuracy and 22.6 WUPS. Similarly to previous 
experiments, we achieve the best performance using the 
“single word” variant. 

4.2. Answering questions without looking at images 

In order to study how much information is already con¬ 
tained in questions, we train a version of our model that 
ignores the visual input. The results are shown in Table 1 
and Table 3 under “Language only (ours)”. The best “Lan¬ 
guage only” models with 17.15% and 32.32% compare very 
well in terms of accuracy to the best models that include vi¬ 
sion. The latter achieve 19.43% and 34.68% on the full and 
reduced set respectively. 

In order to further analyze this finding, we have collected 
a new human baseline “Human answer, no image”, where 
we have asked participants to answer on the DAQUAR 
questions without looking at the images. It turns out that 
humans can guess the correct answer in 7.86% of the cases 
by exploiting prior knowledge and common sense. Inter¬ 
estingly, our best “language only” model outperforms the 
human baseline by over 9%. A substantial number of an¬ 
swers are plausible and resemble a form of common sense 
knowledge employed by humans to infer answers without 
having seen the image. 

4.3. Human Consensus 

We observe that in many cases there is an inter human 
agreement in the answers for a given image and question 
and this is also refiected by the human baseline performance 
on the question answering task of 50.20% [“Human an¬ 
swers” in Table 1]. We study and analyze this effect fur¬ 
ther by extending our dataset to multiple human reference 
answers in Section 4.3.1, and proposing a new measure - 
inspired by the work in psychology [4, 6, 23] - that han¬ 
dles disagreement in Section 4.3.2, as well as conducting 
additional experiments in Section 4.3.3. 






































Figure 5. Study of inter human agreement. At x-axis: no consen¬ 
sus (0%), at least half consensus (50%), full consensus (100%). 
Results in %. Left: consensus on the whole data, right: consensus 
on the test data. 

4.3.1 DAQUAR-Consensus 

In order to study the effects of consensus in the question an¬ 
swering task, we have asked multiple participants to answer 
the same question of the DAQUAR dataset given the respec¬ 
tive image. We follow the same scheme as in the original 
data collection effort, where the answer is a set of words or 
numbers. We do not impose any further restrictions on the 
answers. This extends the original data [20] to an average 
of 5 test answers per image and question. We refer to this 
dataset as DAQUAR-Consensus. 



Accu¬ 

racy 

WUPS 

@0.9 

WUPS 

@0.0 

Subset: No agreement 

Language only (ours) 

- multiple words 

- single word 

8.86 

8.50 

12.46 

12.05 

38.89 

40.94 

Neural-Image-QA (ours) 

- multiple words 

- single word 

10.31 

9.13 

13.39 

13.06 

40.05 

43.48 

Subset: > 50% agreement 

Language only (ours) 

- multiple words 

- single word 

21.17 

20.73 

27.43 

27.38 

66.68 

67.69 

Neural-Image-QA (ours) 

- multiple words 

- single word 

20.45 

24.10 

27.71 

30.94 

67.30 

71.95 

Subset: Full Agreement 

Language only (ours) 

- multiple words 

- single word 

27.86 

25.26 

35.26 

32.89 

78.83 

79.08 

Neural-Image-QA (ours) 

- multiple words 

- single word 

22.85 

29.62 

33.29 

37.71 

78.56 

82.31 


4.3.2 Consensus Measures 

While we have to acknowledge inherent ambiguities in our 
task, we seek a metric that prefers an answer that is com¬ 
monly seen as preferred. We make two proposals: 

Average Consensus: We use our new annotation set that 
contains multiple answers per question in order to compute 
an expected score in the evaluation: 


Table 4. Results on DAQUAR, all classes, single reference in % 
(the subsets are chosen based on DAQUAR-Consensus). 

agreement, we propose a “Min Consensus Metric (MCM)” 
by replacing the averaging in Equation 9 with a max opera¬ 
tor. We call such metric Min Consensus and suggest using 
both metrics in the benchmarks. We will make the imple¬ 
mentation of both metrics publicly available. 


1 

Wk 


N K 

i=l k=l aeA^ ^ teTl 


max /i(a, t)} 
aeA^ 


( 9 ) 


Ly^maxLinin 

i=l \ aeA^ 


max fi{a, t), 
ten 


IT max 

aeA^ 


ten 


( 10 ) 


where for the i-th question A* is the answer generated by the 
architecture and is the k-th possible human answer cor¬ 
responding to the k-th interpretation of the question. Both 
answers A* and are sets of the words, and /i is a member¬ 
ship measure, for instance WUP [34]. We call this metric 
“Average Consensus Metric (ACM)” since, in the limits, as 
K approaches the total number of humans, we truly mea¬ 
sure the inter human agreement of every question. 

Min Consensus: The Average Consensus Metric puts 
more weights on more “mainstream” answers due to the 
summation over possible answers given by humans. In or¬ 
der to measure if the result was at least with one human in 


Intuitively, the max operator uses in evaluation a human an¬ 
swer that is the closest to the predicted one - which repre¬ 
sents a minimal form of consensus. 

4.3.3 Consensus results 

Using the multiple reference answers in DAQUAR- 
Consensus we can show a more detailed analysis of in¬ 
ter human agreement. Figure 5 shows the fraction of the 
data where the answers agree between all available ques¬ 
tions (“100”), at least 50% of the available questions and 
do not agree at all (no agreement - “0”). We observe that 
for the majority of the data, there is a partial agreement, 
but even full disagreement is possible. We split the dataset 






















Accu¬ 

racy 

WUPS 

@0.9 

WUPS 

@0.0 

Average Consensus Metric 

Language only (ours) 

- multiple words 

- single word 

11.60 

11.57 

18.24 

18.97 

52.68 

54.39 

Neural-Image-QA (ours) 

- multiple words 

- single word 

11.31 

13.51 

18.62 

21.36 

53.21 

58.03 

Min Consensus Metric 

Language only (ours) 

- multiple words 

- single word 

22.14 

22.56 

29.43 

30.93 

66.88 

69.82 

Neural-Image-QA (ours) 

- multiple words 

- single word 

22.74 

26.53 

30.54 

34.87 

68.17 

74.51 


Table 5. Results on DAQUAR-Consensus, all classes, consensus 
in %. 

into three parts according to the above criteria “No agree¬ 
ment”, “> 50% agreement” and “Full agreement” and eval¬ 
uate our models on these splits (Table 4 summarizes the 
results). On subsets with stronger agreement, we achieve 
substantial gains of up to 10% and 20% points in accuracy 
over the full set (Table 1) and the Subset: No agreement 
(Table 4), respectively. These splits can be seen as curated 
versions of DAQUAR, which allows studies with factored 
out ambiguities. 

The aforementioned “Average Consensus Metric” gen¬ 
eralizes the notion of the agreement, and encourages pre¬ 
dictions of the most agreeable answers. On the other hand 
“Min Consensus Metric” has a desired effect of providing a 
more optimistic evaluation. Table 5 shows the application 
of both measures to our data and models. 

Moreover, Table 6 shows that “MCM” applied to hu¬ 
man answers at test time captures ambiguities in interpret¬ 
ing questions by improving the score of the human baseline 
from [20] (here, as opposed to Table 5, we exclude the orig¬ 
inal human answers from the measure). It also cooperates 
well with WUPS at 0.9, which takes word ambiguities into 
account, gaining an about 20% higher score. 

4.4. Qualitative results 

We show predicted answers of different variants of our 
architecture in Table 7, 8, and 9. We have chosen the ex¬ 
amples to highlight differences between Neural-Image-QA 
and the “Language only”. We use a “multiple words” ap¬ 
proach only in Table 8, otherwise the “single word” model 
is shown. Despite some failure cases, “Language only” 
makes “reasonable guesses” like predicting that the largest 
object could be table or an object that could be found on the 



Accuracy 

WUPS 

@0.9 

WUPS 

@0.0 

WUPS [20] 

50.20 

50.82 

67.27 

ACM (ours) 

36.78 

45.68 

64.10 

MCM (ours) 

60.50 

69.65 

82.40 


Table 6. Min and Average Consensus on human answers from 
DAQUAR, as reference sentence we use all answers in DAQUAR- 
Consensus which are not in DAQUAR, in % 

bed is either a pillow or doll. 

4.5. Failure cases 

While our method answers correctly on a large part of 
the challenge (e.g. ^ 35 WUPS at 0.9 on “what color” 
and “how many” question subsets), spatial relations (^ 
21 WUPS at 0.9) which account for a substantial part of 
DAQUAR remain challenging. Other errors involve ques¬ 
tions with small objects, negations, and shapes (below 12 
WUPS at 0.9). Too few training data points for the afore¬ 
mentioned cases may contribute to these mistakes. 

Table 9 shows examples of failure cases that include 
(in order) strong occlusion, a possible answer not captured 
by our ground truth answers, and unusual instances (red 
toaster). 

5. Conclusions 

We have presented a neural architecture for answering 
natural language questions about images that contrasts with 
prior efforts based on semantic parsing and outperforms 
prior work by doubling performance on this challenging 
task. A variant of our model that does not use the image to 
answer the question performs only slightly worse and even 
outperforms a new human baseline that we have collected 
under the same condition. We conclude that our model has 
learnt biases and patterns that can be seen as forms of com¬ 
mon sense and prior knowledge that humans use to accom¬ 
plish this task. We observe that indoor scene statistics, spa¬ 
tial reasoning, and small objects are not well captured by 
the global CNN representation, but the true limitations of 
this representation can only be explored on larger datasets. 
We extended our existing DAQUAR dataset to DAQUAR- 
Consensus, which now provides multiple reference answers 
which allows to study inter-human agreement and consen¬ 
sus on the question answer task. We propose two new met¬ 
rics: “Average Consensus”, which takes into account human 
disagreement, and “Min Consensus” that captures disagree¬ 
ment in human question answering. 
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What is on the right side of the cabinet? 

How many drawers are there? 

What is the largest object? 

Neural-Image-QA: bed 

3 

bed 

Language only: bed 

6 

table 


Table 7. Examples of questions and answers. Correct predictions are colored in green, incorrect in red. 




What is on the refrigerator? 


What is the colour of the comforter? What objects are found on the bed? 


Neural-Image-QA: magnet, paper 


blue, white 


bed sheets, pillow 


Language only: 


magnet, paper 


blue, green, red, yellow 


doll, pillow 


Table 8. Examples of questions and answers with multiple words. Correct predictions are colored in green, incorrect in red. 


How many chairs are there? 


What is the object fixed on the window? Which item is red in colour? 


Neural-Image - QA: 


curtain 


remote control 


Language only: 


curtain 


clock 


Ground truth answers: 


handle 


toaster 


Table 9. Examples of questions and answers - failure cases. 
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Accu¬ 

racy 

WUPS 

@0.9 

WUPS 

@0.0 

Malinowski et al. [1] 

7.86 

11.86 

38.79 

Neural-Image-QA (ours) 

- multiple words 

17.49 

23.28 

57.76 

- single word 

19.43 

25.28 

62.00 

Human answers [1] 

50.20 

50.82 

67.27 

Language only (ours) 

- multiple words 

17.06 

22.30 

56.53 

- single word 

17.15 

22.80 

58.42 

Human answers, no images 

7.34 

13.17 

35.56 


Table 1. Results on DAQUAR, all classes, single reference, in %. 
Replication of Table 1 from the main paper for convenience. 

In this supplemental material, we additionally provide 
qualitative examples of different variants of our architecture 
and show the correlations of predicted answer words and 
question words with human answer and question words. 

The examples are chosen to highlight challenges as well 
as differences between “Neural-Image-QA” and “Language 
only” architectures. Table 9 also shows a few failure cases. 
In all cases but “multiple words answer”, we use the best 
“single word” variants. Although “Language only” ignores 
the image, it is still able to make “reasonable guesses” by 
exploiting biases captured by the dataset that can be viewed 
as a type of common sense knowledge. For instance, “tea 
kettle” often sits on the oven, cabinets are usually “brown”, 
“chair” is typically placed in front of a table, and we com¬ 
monly keep a “photo” on a cabinet (Table 2, 4, 5, 8). This 
effect is analysed in Figure 1. Each data point in the plot 
represents the correlation between a question and a pre¬ 
dicted answer words for our “Language only” model (x- 
axis) versus the correlation in the human answers (y-axis). 

Despite the reasonable guesses of the “Language only” 
architecture, the “Neural-Image-QA” predicts in average 
better answers (shown in Table 1 that we have replicated 


0.20 

0.15 



- 0.05 0.00 0.05 0.10 0.15 0.20 

Figure 1. Figure showing correlation between question and an¬ 
swer words of the “Language only” model (at x-axis), and a simi¬ 
lar correlation of the “Human-baseline” [1] (at y-axis). 

from the main paper for the convenience of the reader) by 
exploiting the visual content of images. For instance in Ta¬ 
ble 6 the “Language only” model incorrectly answers “6” 
on the question “How many burner knobs are there ?” be¬ 
cause it has seen only this answer during the training with 
exactly the same question but on different image. 

Both models, “Language only” and “Neural-Image- 
QA”, have difficulties to answer correctly on long questions 
or such questions that expect a larger number of answer 
words (Table 9). On the other hand both models are doing 
well on predicting a type of the question (e.g. “what color 
...” result in a color name in the answer, or “how many ...” 
questions result in a number), there are a few rare cases with 
an incorrect type of the predicted answer (the last example 
in Table 9). 
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What are the objects close to the wall? 

What is on the stove? 

What is left of sink? 

Neural-Image-QA: wall decoration 

tea kettle 

tissue roll 

Language only: books 

tea kettle 

towel 

Ground truth answers: wall decoration 

tea kettle 

tissue roll 


Table 2. Examples of compound answer words. 



How many lamps are there? How many pillows are there on the bed? How many pillows are there on the sofa? 


Neural-Image-QA: 

2 

2 

3 

Language only: 

2 

3 

3 

Ground truth answers: 

2 

2 

3 


Table 3. Counting questions. 



What color is the towel? 

What color are the cabinets? 

What is the colour of the pillows? 

Neural-Image-QA: 

brown 

brown 

black, white 

Language only: 

white 

brown 

blue, green, red 

Ground truth answers: 

white 

brown 

black, red, white 


Table 4. Questions about color. 



























What is hanged on the chair? 


What is the object close to the sink? What is the object on the table in the corner? 


Neural-Image-QA: clothes 


faucet 


lamp 


Language only: jacket 


faucet 


plant 


Ground truth answers: clothes 


faucet 


lamp 


Table 5. Correct answers by our “Neural-Image-QA” architecture. 



What are the things on the cabinet? 


What is in front of the shelf? How many burner knobs are there? 


Neural-Image-QA: photo 


chair 


Language only: photo 


basket 


Ground truth answers: photo 


chair 


Table 6. Correct answers by our “Neural-Image-QA” architecture. 



What is the object close to the counter? What is the colour of the table and chair? How many towels are hanged? 


Neural-Image-QA: 

sink 

brown 

3 

Language only: 

stove 

brown 

4 

Ground truth answers: 

sink 

brown 

3 


Table 7. Correct answers by our “Neural-Image-QA” architecture. 


























What is on the right most side on the table? What are the things on the coffee table? 


Neural-Image-QA: 

lamp 

books 

chair 

Language only: 

machine 

jacket 

chair 

Ground truth answers: 

lamp 

books 

chair 


Table 8. Correct answers by 

our “Neural-Image-QA” architecture. 




What is in front of the table? 



Neural-Image-QA: 


oven 


chair, lamp, photo 


Language only: 


exercise equipment 


candelabra 


curtain 


Ground truth answers: 


garbage bin 


lamp, photo, telephone 


white 


What are the things on the cabinet? What color is the frame 

of the mirror close to the wardrobe? 


What is on the left side of 
the white oven on the floor and 
on right side of the blue armchair? 


Table 9. Failure cases. 





















